We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
translated by 谷歌翻译
Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].
translated by 谷歌翻译
Our long term goal is to use image-based depth completion to quickly create 3D models from sparse point clouds, e.g. from SfM or SLAM. Much progress has been made in depth completion. However, most current works assume well distributed samples of known depth, e.g. Lidar or random uniform sampling, and perform poorly on uneven samples, such as from keypoints, due to the large unsampled regions. To address this problem, we extend CSPN with multiscale prediction and a dilated kernel, leading to much better completion of keypoint-sampled depth. We also show that a model trained on NYUv2 creates surprisingly good point clouds on ETH3D by completing sparse SfM points.
translated by 谷歌翻译
Multilayer perceptrons (MLPs) learn high frequencies slowly. Recent approaches encode features in spatial bins to improve speed of learning details, but at the cost of larger model size and loss of continuity. Instead, we propose to encode features in bins of Fourier features that are commonly used for positional encoding. We call these Quantized Fourier Features (QFF). As a naturally multiresolution and periodic representation, our experiments show that using QFF can result in smaller model size, faster training, and better quality outputs for several applications, including Neural Image Representations (NIR), Neural Radiance Field (NeRF) and Signed Distance Function (SDF) modeling. QFF are easy to code, fast to compute, and serve as a simple drop-in addition to many neural field representations.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability.
translated by 谷歌翻译
我们引入了一个新颖的框架,用于连续的面部运动脱毛,该框架通过矩控制因子恢复单个运动毛面脸部图像中潜在的连续锋利力矩。尽管动作毛刺图像是在曝光时间内连续锋利矩的累积信号,但大多数现有的单个图像脱毛方法旨在使用多个网络和训练阶段恢复固定数量的帧。为了解决这个问题,我们提出了一个基于GAN(CFMD-GAN)的连续面部运动脱毛网络,该网络是一个新颖的框架,用于恢复带有单个网络和单个训练阶段的单个运动型面部图像中潜在的连续力矩。为了稳定网络培训,我们训练发电机以通过面部特定于面部知识的面部基于面部运动的重新排序过程(FMR)确定的顺序恢复连续矩。此外,我们提出了一个辅助回归器,该回归器通过估计连续锋利的力矩来帮助我们的发电机产生更准确的图像。此外,我们引入了一个控制自适应(CONTADA)块,该块执行空间变形的卷积和频道的注意,作为控制因子的函数。 300VW数据集上的大量实验表明,所提出的框架通过改变力矩控制因子来生成各种连续的输出帧。与最近使用相同300VW训练集训练的最近的单一单击图像脱蓝色网络相比,提出的方法显示了在感知指标(包括LPIPS,FID和Arcface身份距离)方面恢复中央锋利框架的出色性能。该方法的表现优于现有的单一视频脱蓝和用于定性和定量比较的方法。
translated by 谷歌翻译
我们提出了一个混合框架Oppinn:物理知识的神经网络(PINN),其中运算符学习以近似于Fokker-Planck-Landau(FPL)方程的解决方案。 Oppinn框架分为两个步骤:步骤1和步骤2。在步骤1期间对操作员替代模型进行训练后,PINN可以使用预训练的替代模型在步骤2期间有效地近似于FPL方程。操作员替代模型可大大降低计算成本,并通过近似FPL方程中的复杂Landau碰撞积分来提高PINN。操作员替代模型也可以与传统的数值方案结合使用。当速度模式变大时,它在计算时间中提供了高效率。使用Oppinn框架,我们在各种类型的初始条件下为FPL方程提供了神经网络解决方案,并在两个和三个维度中提供相互作用模型。此外,基于FPL方程的理论属性,我们表明,随着预定义的损耗函数的降低,近似的神经网络解决方案会收敛到FPL方程的先验经典解。
translated by 谷歌翻译
半监督学习(SSL)的最新最新方法将一致性正则化与基于置信的伪标记结合在一起。为了获得高质量的伪标签,通常采用高置信度阈值。但是,已经表明,对于远离训练数据的样本,深网的基于软磁性的置信度得分可能很高,因此,即使是高信心不明的样品,伪标签也可能仍然不可靠。在这项工作中,我们提出了伪标记的新观点:而不是依靠模型信心,而是衡量未标记的样本是否可能是“分布”;即,接近当前的培训数据。为了对未标记的样本进行分类是“分布”还是“分发”,我们采用了分布外检测文献中的能量评分。随着培训的进行进展,更不标记的样品成为分配并有助于培训,标记和伪标记的数据可以更好地近似于真正的分布以改善模型。实验表明,我们的基于能量的伪标记方法,尽管从概念上讲简单,但在不平衡的SSL基准测试方面显着优于基于置信的方法,并在类平衡的数据上实现了竞争性能。例如,当不平衡比率高于50时,它会在CIFAR10-LT上产生4-6%的绝对准确性提高。当与最新的长尾SSL方法结合使用时,可以实现进一步的改进。
translated by 谷歌翻译